KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

rajinisivaram · 2018-02-07T14:51:38Z

Handle listener-not-found in MetadataCache since this can occur when listeners are being updated. To avoid breaking clients, this is handled in the same way as broker-not-available so that clients retry
Set retries=1000 for listener reconfiguration tests to avoid transient failures when metadata cache has not been updated
Remove IdlePercent metric when Processor is deleted, add test
Reduce log segment size used during reconfiguration to avoid timeout while waiting for log rolling
5.Test markPartitionsForTruncation after fetcher thread resize

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

rajinisivaram · 2018-02-07T14:52:04Z

@hachikuji Can you review, please? Thank you!

hachikuji

Thanks for the patch. Left one question and a couple minor comments.

hachikuji · 2018-02-08T00:30:58Z

core/src/main/scala/kafka/server/MetadataCache.scala

+        val node = nodeMap.get(listenerName)
+        warn(s"Broker endpoint not found for broker $brokerId listenerName $listenerName")
+        node
+      }.getOrElse(None)


nit: I think you don't need this if you use flatMap.

hachikuji · 2018-02-08T00:51:32Z

core/src/test/scala/integration/kafka/server/DynamicBrokerReconfigurationTest.scala

+    val numProcessors = servers.head.config.numNetworkThreads * 2 // 2 listeners
+
+    val kafkaMetrics = servers.head.metrics.metrics().keySet.asScala
+      .filter(_.tags.containsKey("networkProcessor"))


There is also a response queue size metric which uses the "processor" tag. Maybe we can add a check to ensure its deletion as well?

hachikuji · 2018-02-08T01:01:36Z

core/src/main/scala/kafka/server/MetadataCache.scala

      aliveNodes.get(brokerId).map { nodeMap =>
-        nodeMap.getOrElse(listenerName,
-          throw new BrokerEndPointNotAvailableException(s"Broker `$brokerId` does not have listener with name `$listenerName`"))


To check my understanding, previously when we raised this exception, the Metadata request would have failed with an unknown server error (since this exception has no error code) which would have probably been raised to the user. Is that right? Now we will return LEADER_NOT_AVAILABLE instead and the client will retry.

I am wondering in this case if we really should have a separate error code to indicate that there is no listener provided so that we can at least log a warning in the client. It seems more likely that this is the result of a misconfiguration than a delayed config update.

Yes, we should be careful here. Otherwise we may create a hard to diagnose problem for a common case (misconfigured listener).

@hachikuji @ijuma Thanks for the reviews. At the moment, as Jason said, user sees an unknown server error which is not retried. Neither client nor broker has any errors in the logs to show what went wrong. I did initially consider adding a new error code for this, but the problem is that old clients wouldn't recognize the error code and I thought they wouldn't retry as a result (may be I am wrong). So I thought LEADER_NOT_AVAILABLE is a reasonable error to send to the client. Since the problem occurs only if some brokers have a listener and others don't, I was thinking it was sufficient to have a log entry in the broker logs.

@rajinisivaram if we bumped the version of the relevant protocols in the 1.1 cycle, we could conditionally return a new error code. And fallback to LEADER_NOT_AVAILABLE, otherwise. If we didn't bump them, then it's less clear if it's worth doing it just for this case.

We don't seem to have bumped up the version of MetadataRequest or FindCoordinatorRequest in 1.1 (not sure if there are other requests which use this code).

Hmm.. That's a fair point. Something else to consider is whether we should log the message when we receive the UpdateMetadata request from the controller rather than the Metadata request from the clients.

@hachikuji Yes, that makes sense, updated. Do we still want to change protocol version and add a new error code for 1.1?

@hachikuji @rajinisivaram How about we add it to the KIP but implement it in the next version? I believe we have KIPs in progress that suggest changing MetadataRequest and we could piggyback on one of them.

That sounds good to me. I think this is still an improvement over existing behavior.

@hachikuji @ijuma Sounds good. I will update the KIP and create a JIRA for the next version. I think all the other comments on this one have been addressed. Let me know if anything else needs to be done for 1.1. Thanks.

ijuma · 2018-02-08T09:42:21Z

core/src/main/scala/kafka/network/SocketServer.scala

@@ -740,6 +740,7 @@ private[kafka] class Processor(val id: Int,
      close(channel.id)
    }
    selector.close()
+    removeMetric("IdlePercent", Map("networkProcessor" -> id.toString))


Can we add a constant for the metric name?

We don't seem to use constants for other metric names, looks odd to have just for this one?

I don't see any other metrics in this method. Generally, if the same magic value is used in 2 places, we should definitely use a constant. We don't follow this rule consistently, which is a shame.

ijuma · 2018-02-08T09:45:21Z

core/src/main/scala/kafka/server/MetadataCache.scala

-          throw new BrokerEndPointNotAvailableException(s"Broker `$brokerId` does not have listener with name `$listenerName`"))
-      }
+        val node = nodeMap.get(listenerName)
+        warn(s"Broker endpoint not found for broker $brokerId listenerName $listenerName")


Hmm, shouldn't this be logged if node is None?

hachikuji · 2018-02-08T17:52:49Z

@rajinisivaram Note that the test failure appears related.

rajinisivaram · 2018-02-08T19:12:14Z

@hachikuji Yes, thank you, I will fix the test failure and rebase.

hachikuji

LGTM, and thanks for fixing the metric that I broke.

1. Handle listener-not-found in MetadataCache since this can occur when listeners are being updated. To avoid breaking clients, this is handled in the same way as broker-not-available so that clients may retry. 2. Set retries=1000 for listener reconfiguration tests to avoid transient failures when metadata cache has not been updated 3. Remove IdlePercent metric when Processor is deleted, add test 4. Reduce log segment size used during reconfiguration to avoid timeout while waiting for log rolling 5.Test markPartitionsForTruncation after fetcher thread resize 6. Move per-processor ResponseQueueSize metric back to RequestChannel. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

hachikuji reviewed Feb 8, 2018

View reviewed changes

ijuma reviewed Feb 8, 2018

View reviewed changes

rajinisivaram force-pushed the KAFKA-6501-dynamic-broker-tests branch from c67a5c8 to 3a6ed70 Compare February 8, 2018 10:45

rajinisivaram added 3 commits February 8, 2018 19:22

KAFKA-6501: Dynamic broker config tests updates and metrics fix

4cac9f9

Address review comments

d4d9752

Remove metrics left over from previous tests to avoid build failure

d005c79

hachikuji mentioned this pull request Feb 8, 2018

MINOR: Move processor response queue into Processor #4542

Merged

3 tasks

rajinisivaram force-pushed the KAFKA-6501-dynamic-broker-tests branch from 3a6ed70 to d005c79 Compare February 8, 2018 19:57

Address review comments

0c04186

hachikuji approved these changes Feb 9, 2018

View reviewed changes

hachikuji merged commit 15bc405 into apache:trunk Feb 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

rajinisivaram commented Feb 7, 2018

rajinisivaram commented Feb 7, 2018

hachikuji left a comment

hachikuji Feb 8, 2018

hachikuji Feb 8, 2018

hachikuji Feb 8, 2018

ijuma Feb 8, 2018

rajinisivaram Feb 8, 2018

ijuma Feb 8, 2018

rajinisivaram Feb 8, 2018 •

edited

hachikuji Feb 8, 2018 •

edited

rajinisivaram Feb 8, 2018

ijuma Feb 8, 2018

hachikuji Feb 8, 2018

rajinisivaram Feb 8, 2018

ijuma Feb 8, 2018

rajinisivaram Feb 8, 2018

ijuma Feb 8, 2018

ijuma Feb 8, 2018

hachikuji commented Feb 8, 2018

rajinisivaram commented Feb 8, 2018

hachikuji left a comment

KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

Conversation

rajinisivaram commented Feb 7, 2018

Committer Checklist (excluded from commit message)

rajinisivaram commented Feb 7, 2018

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajinisivaram Feb 8, 2018 • edited

Choose a reason for hiding this comment

hachikuji Feb 8, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Feb 8, 2018

rajinisivaram commented Feb 8, 2018

hachikuji left a comment

Choose a reason for hiding this comment

rajinisivaram Feb 8, 2018 •

edited

hachikuji Feb 8, 2018 •

edited